3.0 Exploring the Tidyverse

The tidyverse package is a very popular and important package within R. Loading it onto your environment grants you the ability to work with “tidy” data and allows you a higher ease of manipulating dataframes that fall under this data format. Tidy data is any data frame or table where each row represents one observation and each column represents a different variable available for each observation (almost every data frame we have created up to this point counts as a tidy data frame). There are many datasets out there that are not in tidy format and it is there when you must reshape it to tidy in order to be able to manipulate it (we will cover how to do that in later lessons).

Some examples of non tidy data are found below

data("co2")
head(co2)
## [1] 315.42 316.31 316.50 317.56 318.13 318.00
data("BOD")
head(BOD)
##   Time demand
## 1    1    8.3
## 2    2   10.3
## 3    3   19.0
## 4    4   16.0
## 5    5   15.6
## 6    7   19.8

This section will cover some of the base functions found within the tidyverse package, these being the mutate, filter, and select functions.

3.1 Mutating (adding columns)

The function mutate allows us to add additional columns without having to run much syntax. The way that the command works is that it take the data frame we want as the first argument, and the name and values of the new variable as a second argument using the “name = values” format. We will practice adding a new variable to the data set below.

# install.packages("tidyverse")
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(readr)
library(dplyr)
library(purrr)

# import data set from previous lesson
setwd("/Volumes/GoogleDrive-115381348121898517757/My Drive/All School Files/USC PHD/Files/Non-Class Material/UCLA Summer Course - Intro to Data Science/Datasets")

diabetes <- read_csv("diabetes.csv")
## Rows: 768 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (9): Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, D...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# creating a new variable called "geriatric" using an ifelse function
diabetes_v2 <- mutate(diabetes, geriatric = ifelse(Age > 35, 1,0))

# Traditional way of adding a variable to the dataset 
diabetes$geriatric <- ifelse(diabetes$Age > 35, 1,0)

3.2 Filtering through data

Suppose that we want to filter the data table to only show the entries for which the BMI is higher than 23. To do this we use the filter function, which takes the data table as the first argument and then the conditional statement as the second. Like mutate, we can use the unquoted variable names from Diabetes inside the function and it will know we mean the columns and not objects in the workspace.

# filtering through BMI
filter(diabetes, BMI >= 23)
## # A tibble: 707 × 10
##    Pregnan…¹ Glucose Blood…² SkinT…³ Insulin   BMI Diabe…⁴   Age Outcome geria…⁵
##        <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <dbl>   <dbl> <dbl>   <dbl>   <dbl>
##  1         6     148      72      35       0  33.6   0.627    50       1       1
##  2         1      85      66      29       0  26.6   0.351    31       0       0
##  3         8     183      64       0       0  23.3   0.672    32       1       0
##  4         1      89      66      23      94  28.1   0.167    21       0       0
##  5         0     137      40      35     168  43.1   2.29     33       1       0
##  6         5     116      74       0       0  25.6   0.201    30       0       0
##  7         3      78      50      32      88  31     0.248    26       1       0
##  8        10     115       0       0       0  35.3   0.134    29       0       0
##  9         2     197      70      45     543  30.5   0.158    53       1       1
## 10         4     110      92       0       0  37.6   0.191    30       0       0
## # … with 697 more rows, and abbreviated variable names ¹​Pregnancies,
## #   ²​BloodPressure, ³​SkinThickness, ⁴​DiabetesPedigreeFunction, ⁵​geriatric
## # ℹ Use `print(n = ...)` to see more rows
BMI <- filter(diabetes, BMI >= 23)

3.3 Selecting specific data variables

Although our data table only has 9 columns, some data tables include hundreds. If we want to view just a few, we can use the dplyr select function. In the code below we select three columns, assign this to a new object and then filter the new object

new_diabetes <- select(diabetes, Age, BMI, Glucose)
filter(new_diabetes, BMI >=23)
## # A tibble: 707 × 3
##      Age   BMI Glucose
##    <dbl> <dbl>   <dbl>
##  1    50  33.6     148
##  2    31  26.6      85
##  3    32  23.3     183
##  4    21  28.1      89
##  5    33  43.1     137
##  6    30  25.6     116
##  7    26  31        78
##  8    29  35.3     115
##  9    53  30.5     197
## 10    30  37.6     110
## # … with 697 more rows
## # ℹ Use `print(n = ...)` to see more rows
# if we want to sort through this new dataset by Age where we get the youngest to oldest, this is what we do 

new_diabetes |> 
  arrange(Age) |>
  tail()
## # A tibble: 6 × 3
##     Age   BMI Glucose
##   <dbl> <dbl>   <dbl>
## 1    68  35.6      91
## 2    69  26.8     132
## 3    69   0       136
## 4    70  32.5     145
## 5    72  19.6     119
## 6    81  25.9     134
# if you want descending order of Age
new_diabetes |> 
  arrange(desc(Age)) |>
  head()
## # A tibble: 6 × 3
##     Age   BMI Glucose
##   <dbl> <dbl>   <dbl>
## 1    81  25.9     134
## 2    72  19.6     119
## 3    70  32.5     145
## 4    69  26.8     132
## 5    69   0       136
## 6    68  35.6      91
# if we want to group by a specific variable, in this case, geriatric, we can do the following

diabetes_v2 |> group_by(geriatric)
## # A tibble: 768 × 10
## # Groups:   geriatric [2]
##    Pregnan…¹ Glucose Blood…² SkinT…³ Insulin   BMI Diabe…⁴   Age Outcome geria…⁵
##        <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <dbl>   <dbl> <dbl>   <dbl>   <dbl>
##  1         6     148      72      35       0  33.6   0.627    50       1       1
##  2         1      85      66      29       0  26.6   0.351    31       0       0
##  3         8     183      64       0       0  23.3   0.672    32       1       0
##  4         1      89      66      23      94  28.1   0.167    21       0       0
##  5         0     137      40      35     168  43.1   2.29     33       1       0
##  6         5     116      74       0       0  25.6   0.201    30       0       0
##  7         3      78      50      32      88  31     0.248    26       1       0
##  8        10     115       0       0       0  35.3   0.134    29       0       0
##  9         2     197      70      45     543  30.5   0.158    53       1       1
## 10         8     125      96       0       0   0     0.232    54       1       1
## # … with 758 more rows, and abbreviated variable names ¹​Pregnancies,
## #   ²​BloodPressure, ³​SkinThickness, ⁴​DiabetesPedigreeFunction, ⁵​geriatric
## # ℹ Use `print(n = ...)` to see more rows
# special kind of ifelse that works with tidyverse. This case allows us to create or define categorical variables that we may have within our dataset

x <- c(-2,-1,0,1,2)
case_when(x < 0 ~ "Negative",
          x > 0 ~ "Positive",
          TRUE ~ "Zero")
## [1] "Negative" "Negative" "Zero"     "Positive" "Positive"

4.0 Plots and Visualization

The following section will be taught a little differently. The code chunks will be provided and you will follow along and program with me throughout the activities as I explain what each function does.

library(dplyr)
library(ggplot2) # we will install both packages necessary to begin plotting

ggplot(data = diabetes)

# install.packages("datasets")
library(datasets)
data("mtcars")
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
# one dimensional plot is one where you plot one single variable at a time
boxplot(mtcars$mpg, col= "green")

hist(mtcars$mpg, col = "green", breaks = 25)  ## Plot 2

hist(mtcars$mpg, col = "green", breaks = 50)  ## Plot 3

barplot(table(mtcars$carb), col="grey")

# Two dimensional plots
boxplot(mpg~wt, data=mtcars, col = "grey")

hist(subset(mtcars, cyl == 4)$mpg, col = "green") 

with(mtcars, plot(wt, mpg))

# Using the plot function in r 
plot(3, 4)

plot(c(1, 3, 4), c(4, 5 , 8))

plot(1:20)

# Values for x and y axis
x <- 1:5; y = x * x
 
# Using plot() function
plot(x, y, type = "l") # l stands for line

plot(x, y, type = "h") # h stands for histogram

# R program to plot a graph
 
# Creating x and y-values
x - 1:5; y = x * x
## [1] 0 0 0 0 0
# Using plot function
plot(x, y, type = "b")

plot(x, y, type = "s")

plot(x, y, type = "p")

The following chunks of code is an example of just how one can use the culmination of conditional statements and lists to create beautiful plots. Credits to https://towardsdatascience.com/christmas-cards-81e7e1cce21c for showcasing this code.

# install.packages("plotly")
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
set.seed(24)

n_tree <- 1000
n_ornaments <- 20
n_lights <- 300

# Generate spiral data points
x <- c()
y <- c()
z <- c()

for (i in 1:n_tree) {
    r <- i / 30
    x <- c(x, r * cos(i / 30))
    y <- c(y, r * sin(i / 30))
    z <- c(z, n_tree - i)
}

tree <- data.frame(x, y, z)

# Sample for ornaments:
#   - sample n_ornaments points from the tree spiral
#   - modify z so that the ornaments are below the line
#   - color column: optional, add if you want to add color range to ornaments
ornaments <- tree[sample(nrow(tree), n_ornaments), ]
ornaments$z <- ornaments$z - 50
ornaments$color <- 1:nrow(ornaments)

# Sample for lights:
#   - sample n_lights points from the tree spiral
#   - Add normal noise to z so the lights spread out
lights <- tree[sample(nrow(tree), n_lights), ]
lights$x <- lights$x + rnorm(n_lights, 0, 20)
lights$y <- lights$y + rnorm(n_lights, 0, 20)
lights$z <- lights$z + rnorm(n_lights, 0, 20)

# hide axes
ax <- list(
    title = "",
    zeroline = FALSE,
    showline = FALSE,
    showticklabels = FALSE,
    showgrid = FALSE
)

plot_ly() %>%
    add_trace(data = tree, x = ~x, y = ~y, z = ~z,
              type = "scatter3d", mode = "lines",
              line = list(color = "#1A8017", width = 7)) %>%
    add_markers(data = ornaments, x = ~x, y = ~y, z = ~z,
                type = "scatter3d",
                marker = list(color = ~color,
                              colorscale = list(c(0,'#EA4630'), c(1,'#CF140D')),
                              size = 15)) %>%
    add_markers(data = lights, x = ~x, y = ~y, z = ~z,
                type = "scatter3d",
                marker = list(color = "#FDBA1C", size = 3, opacity = 0.8)) %>%
    layout(scene = list(xaxis=ax, yaxis=ax, zaxis=ax), showlegend = FALSE)